REGRESSION AS A MACHINE:

An Introduction to Machine Learning

Marco R. Steenbergen

University of Zurich

What Is Machine Learning?

Of Experiences, Tasks, and Performance

Definition

The design of algorithms that learn from data and are capable of improvement as new experiences emerge. Specifically, the algorithm “is said to learn from experience \(E\) with respect to some task \(T\) and performance measure \(P\), if its performance at task \(T\), as measured by \(P\), improves with experience \(E\)(Mitchell, 1997, p. 2).

A Continuum of Machine Learning Algorithms

The poles are:

  1. Unsupervised machine learning:
    • Does not require labeled outcomes.
  2. Supervised machine learning:
    • Requires labeled outcomes.
    • These could be
      • Classes, i.e., classification tasks.
      • Numerical scores, i.e., regression tasks.

Machine Learning in the Social Sciences

The main uses of machine learning in (applied) social science are the:

  1. Automation of tedious, time-consuming, and error prone coding tasks.
  2. Prediction of behavior (also in applied work).
  3. Induction: Developing new theories from patterns in data (Breiman, 2001; also Korb, 2004).

Essentials of Supervised Machine Learning

A Different Style of Data Science

  • The typical approach in the social sciences involves
    • Imposing a model on the data.
    • Drawing inferences across the whole sample.
    • A primary focus on interpretation.
  • Supervised machine learning involves
    • Learning a model from the data.
    • Dividing the data into test and training sets.
    • A primary focus on prediction.

A Different Jargon

Traditional

  • Data

  • Unit

  • Variable

  • Outcome

Machine Learning

  • Set

  • Sample, example, or instance

  • Feature

  • Label

The Generic Learning Problem

Model

An algorithm learns the model

\[ y_i = \phi \left(\boldsymbol{x}_i,\boldsymbol{\theta} \right) + \varepsilon_i \qquad(1)\]

  • \(i\) is the instance.

  • \(y\) is the label (a class or a score).

  • \(\phi(.)\) is an unknown function.

  • \(\boldsymbol{x}\) is the vector of predictive features.

  • \(\boldsymbol{\theta}\) are the unknown (tuning) parameters.

  • \(\varepsilon\) is irreducible error.

The Generic Prediction Task

Prediction

\[ \psi_i = f \left( \boldsymbol{x}_i^*, \boldsymbol{q} \right) \qquad(2)\]

  • \(\psi\) is the predicted value.

  • \(f(.)\) is the learnt functional form.

  • \(\boldsymbol{x}^*\) is the subset of relevant features \(\Rightarrow\) feature selection.

  • \(\boldsymbol{q}\) contains the optimal parameter values.

The Loss Function

  • In statistics, loss = error.

  • For at least one instance, \(\psi_i \neq y_i\).

  • The loss function quantifies the totality of error in the scalar \(L \left( \psi_i,y_i \right)\).

  • This quantity plays a crucial role in the learning process, which selects \(f(.)\) and \(\boldsymbol{q}\) in such a way that loss is minimized.

  • Later, we shall see how this is done.

Performance

  • Performance metrics measure how well the model accounts for the data and are given by \(P \left( \psi_i, y_i \right)\).

  • In some cases, they are derived directly from the loss function; in other cases, they follow their own logic (see Japkowicz & Shah, 2011).

  • Regardless, our goal is to achieve outstanding predictive performance (but see Thomas & Uminsky, 2020).

A Division of Labor

  • Training = the process of learning the functional form, parameters, and relevant features.

  • Evaluation = the process of assessing model performance.

  • Cardinal rule: Never train and evaluate on the same data!

  • Instead, we generate:

    • A training set for training purposes

    • A test set for evaluation purposes

The Split Sample Approach

  • If necessary randomize the \(n\) instances.

  • Set aside a fraction of \(p\) instances for training \(\Rightarrow\) training set of size \(n_1 = p \cdot n\).

  • Set aside a fraction of \(1-p\) instances for evaluation \(\Rightarrow\) test set of size \(n_2 = n - n_1\).

Not by Performance Alone

  • Although performance remains of paramount importance, interpretation becomes ever more important in machine learning.

  • In the words of Ribeiro et al. (2016, p. 1), “if the users do not trust a model or a prediction, they will not use it.”

  • Interpretation can be easier or more complex, depending on the nature of the algorithm.

The Machine Learning Workflow

The Workflow in tidymodels

Source: RPubs

Theory of Regression Analysis

Preliminary Remarks

  • The familiar regression model can be cast as a machine learning algorithm.

  • It is limited to regression tasks.

  • It is also somewhat atypical:

    • \(\phi(.)\) is opposed instead of learnt.

    • There are no tuning parameters.

    • There is no feature selection.

  • Because of its simplicity, however, it is a useful didactic starting point.

The Model

Specification

\[ y_i = \beta + \sum_{j=1}^P \omega_j x_{ij} + \varepsilon_i \qquad(3)\]

  • \(\beta\) is the bias (a.k.a. constant or intercept).

  • \(\omega\) is the weight (a.k.a. partial slope coefficient).

Geometric Interpretation

Hyperplane

Feature Space

Note: Figures are based on the iris data, specifically the Versicolor subset. X1 = sepal width, X2 = petal width, and y = sepal length.

Loss Function

  • For regression tasks, several loss functions can be formulated.

  • Two common functions are:

    • Absolute loss: A.k.a. the (squared) \(L_1\)-norm, the goal is to minimize \(\sum_{i=1}^{n_1} \lvert \psi_i - y_i \rvert\).

    • Quadratic loss: A.k.a. the (squared) \(L_2\)-norm, the goal is to minimize \(\sum_{i=1}^{n_1} \left( \psi_i - y_i \right)^2\).

  • Quadratic loss is the foundation of ordinary least squares.

  • Absolute loss is used to minimize the impact of outliers.

Performance

For regression tasks, three metrics are commonly computed over the test data:

  1. Mean absolute error: \(\text{MAE} = \frac{1}{n_2} \sum_{i=1}^{n_2} \lvert \psi_i - y_i \rvert\).
  2. Root mean squared error: \(\text{RMSE} = \sqrt{\frac{1}{n_2} \sum_{i=1}^{n_2} \left( \psi_i - y_i \right)^2}\).
  3. R-Squared: \(R^2 = r_{\psi,y}^2\).

Interpretation

  • A typical approach is to show a partial dependence plot.

  • Here, we distinguish between

    • One or two focal predictive features, \(\boldsymbol{x}_f\).

    • Controls, \(\boldsymbol{x}_c\).

  • We average the predicted values across the controls

\[ f(\boldsymbol{x}_f,\boldsymbol{q}) = \mathbb{E}_{\boldsymbol{x}_c} \left[ f(\boldsymbol{x}_c,\boldsymbol{x}_f,\boldsymbol{q}) \right] = \int f(\boldsymbol{x}_c,\boldsymbol{x}_f,\boldsymbol{q}) dp(\boldsymbol{x}_c) \]

  • This is the equivalent of the (in)famous ceteris paribus condition.

Practice of Regression Analysis

Happiness around the Globe

  • 2023 World Happiness Report (Helliwell et al., 2023).

  • Gallup World Poll data.

  • Label: “Please imagine a ladder with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?”

  • Log per capita GDP.

  • Social support from friends in troubled times.

  • Healthy life expectancy.

  • Freedom to make life choices.

  • Generosity, i.e., charitable donations.

  • Perceptions of corruption.

Splitting the Sample Using rsample

library(rio)
library(tidyverse)
happy_df <- import("Data/whr23.dta")
row.names(happy_df) <- happy_df$country
happy_clean <- happy_df %>%
  select(happiness,logpercapgdp,socialsupport,healthylifeexpectancy,
         freedom,generosity,perceptionsofcorruption) %>%
  na.omit
library(tidymodels)
tidymodels_prefer()
set.seed(2922)
happy_split <- initial_split(happy_clean, prop = 0.6, strata = happiness)
happy_split
<Training/Testing/Total>
<80/56/136>
training_set <- training(happy_split)
test_set <- testing(happy_split)
ks.test(training_set$happiness, test_set$happiness)

    Exact two-sample Kolmogorov-Smirnov test

data:  training_set$happiness and test_set$happiness
D = 0.067857, p-value = 0.9947
alternative hypothesis: two-sided

Pre-Processing Using recipes

  • For a regression problem, there really is often not any pre-processing needed.

  • The predictive feature per capita GDP is typically skewed and might require a transformation, but that was already done for us.

  • Hence, we skip this element for now and return to it shortly.

Training Using parsnip

  • We specify the algorithm, any tuning parameters, and the mode (i.e., task).

  • We set the external learning engine (= implementation of the algorithm) using set_engine.

  • We declare the model formula and training set using fit.

  • Algorithms and engines can be found at search parsnip models.

happy_fit <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm") %>%
  fit(happiness ~ ., data = training_set)
tidy(happy_fit)
# A tibble: 7 × 5
  term                    estimate std.error statistic      p.value
  <chr>                      <dbl>     <dbl>     <dbl>        <dbl>
1 (Intercept)              -2.37      1.01      -2.35  0.0217      
2 logpercapgdp              0.155     0.0874     1.77  0.0808      
3 socialsupport             4.45      0.704      6.32  0.0000000185
4 healthylifeexpectancy     0.0326    0.0177     1.84  0.0691      
5 freedom                   1.91      0.627      3.04  0.00325     
6 generosity                0.209     0.410      0.508 0.613       
7 perceptionsofcorruption  -1.07      0.439     -2.45  0.0169      
glance(happy_fit)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.813         0.798 0.515      53.0 1.24e-24     6  -56.7  129.  148.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Evaluation Using yardstick

happy_fit %>%
  predict(test_set) %>%
  bind_cols(test_set)
# A tibble: 56 × 8
   .pred happiness logpercapgdp socialsupport healthylifeexpectancy freedom
 * <dbl>     <dbl>        <dbl>         <dbl>                 <dbl>   <dbl>
 1  5.34      5.33         9.30         0.855                  66.5   0.571
 2  6.00      6.02         9.96         0.891                  67.2   0.823
 3  6.68      6.86        10.8          0.915                  70.9   0.825
 4  3.45      4.37         8.10         0.437                  56.1   0.743
 5  5.71      5.63         9.62         0.880                  67.3   0.746
 6  4.09      4.64         7.67         0.663                  55.5   0.696
 7  4.18      4.97         8.22         0.686                  55.8   0.686
 8  7.01      6.96        10.8          0.929                  71.4   0.874
 9  6.07      6.33        10.1          0.889                  70.3   0.792
10  5.65      5.63         9.58         0.822                  69.3   0.804
# ℹ 46 more rows
# ℹ 2 more variables: generosity <dbl>, perceptionsofcorruption <dbl>
happy_fit %>%
  predict(test_set) %>%
  bind_cols(test_set) %>%
  metrics(truth = happiness, estimate = .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.472
2 rsq     standard       0.855
3 mae     standard       0.368

Partial Dependence Plot

DALEX = Descriptive mAchine Learning EXplanations (Biecek, 2018)

library(DALEXtra)
happy_explainer <- explain_tidymodels(
  model = happy_fit,
  data = test_set,
  y = test_set$happiness
)
Preparation of a new explainer is initiated
  -> model label       :  model_fit  (  default  )
  -> data              :  56  rows  7  cols 
  -> target variable   :  56  values 
  -> predict function  :  yhat.model_fit  will be used (  default  )
  -> predicted values  :  No value for predict function target column. (  default  )
  -> model_info        :  package parsnip , ver. 1.1.0 , task regression (  default  ) 
  -> predicted values  :  numerical, min =  3.025659 , mean =  5.443398 , max =  7.56488  
  -> residual function :  difference between y and yhat (  default  )
  -> residuals         :  numerical, min =  -1.065165 , mean =  0.120227 , max =  1.130251  
  A new explainer has been created!  
pdp_gdp <- model_profile(
  happy_explainer,
  variables = "logpercapgdp",
  N = NULL
)
as_tibble(pdp_gdp$agr_profiles) %>%
  ggplot(aes(`_x_`, `_yhat_`)) +
  geom_line(size = 1.2, alpha = 0.8, color = "#31688EFF") +
  geom_rug(sides = "b", col = "#31688EFF") +
  labs(
    x = "Log Per Capita GDP",
    y = "Predicted Happiness",
    color = NULL
  ) +
  theme_bw()

Hooray!

  • You just completed your first machine learning project.

  • However, we took a shortcut by working with a subset of the data, instead of the whole file.

  • Our code was not optimal since we did not create a work flow.

The Whole of the Data

happy_df <- import("Data/whr23.dta")
head(happy_df)
              country happiness stderr logpercapgdp socialsupport
1         Afghanistan     1.859  0.033        7.324         0.341
2             Albania     5.277  0.066        9.567         0.718
3             Algeria     5.329  0.062        9.300         0.855
4             Andorra        NA     NA           NA            NA
5              Angola        NA     NA           NA            NA
6 Antigua and Barbuda        NA     NA           NA            NA
  healthylifeexpectancy freedom generosity perceptionsofcorruption continent
1                54.712   0.382     -0.081                   0.847         3
2                69.150   0.794     -0.007                   0.878         4
3                66.549   0.571     -0.117                   0.717         1
4                    NA      NA         NA                      NA         4
5                    NA      NA         NA                      NA         1
6                    NA      NA         NA                      NA         2
  ISO3
1  AFG
2  ALB
3  DZA
4  AND
5  AGO
6  ATG

Compared to the earlier example,

  • There are missing data.

  • We have a column country identifying the country, which is not a predictive feature.

  • We have a column stderr, which is the standard error on the happiness scores and will be ignored.

  • We have a column continent, which looks numeric but is a factor.

  • We have a column ISO3 that is fully redundant with country.

  • Let us first,

    • Drop ISO3 and stderr.

    • Turn continent into a proper factor.

    • Drop missing values (since we have no information about them except for their location).

  • Next, we should create a recipe that

    • Creates dummies for continent, since that is what the regression needs.

    • Ensures that country will not be used as a predictor but remains available to flag instances.

Building a Recipe

work_df <- happy_df %>%
  mutate(continent = factor(continent,
                            labels = c("Africa", "Americas", "Asia",
                                          "Europe", "Oceania"))) %>%
  select(-c(stderr, ISO3)) %>%
  na.omit
set.seed(1560)
happy_split <- initial_split(work_df, prop = 0.6)
training_set <- training(happy_split)
test_set <- testing(happy_split)
happy_recipe <- 
  recipe(happiness ~ ., data = training_set) %>%
  update_role(country, new_role = "id") %>%
  step_dummy(continent)
tidy(happy_recipe)
# A tibble: 1 × 6
  number operation type  trained skip  id         
   <int> <chr>     <chr> <lgl>   <lgl> <chr>      
1      1 step      dummy FALSE   FALSE dummy_WZw1Q

How Does recipes Process Data?

  1. The recipe call screens the data for types (numeric, factors, strings, and Booleans), as well as roles (e.g., outcome or predictor).
  2. Any estimation is performed on the data specified in the recipe. If this is the training set, then a consequence is that the test set will never be touched—as should be the case.
  3. Estimates from the training set will then be used to inform operations in the test set; nothing new will be estimated in the test set—again, as should be the case.

Building a Workflow

  • Work flows are a very efficient way to build machine learning projects.

  • They recognize that modeling is not just about fitting a model, but also about various pre- and post-processing steps.

  • As such, work flows include elements from parsnip, recipes, and possibly other packages.

A Regression Work Flow

happy_model <-
  linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm")
happy_flow <- 
  workflow() %>%
  add_model(happy_model) %>%
  add_recipe(happy_recipe)
happy_fit <- fit(happy_flow, training_set)
tidy(happy_fit)
# A tibble: 11 × 5
   term                    estimate std.error statistic    p.value
   <chr>                      <dbl>     <dbl>     <dbl>      <dbl>
 1 (Intercept)              -2.23      1.27      -1.76  0.0832    
 2 logpercapgdp              0.152     0.0944     1.61  0.111     
 3 socialsupport             3.50      0.724      4.83  0.00000766
 4 healthylifeexpectancy     0.0263    0.0205     1.28  0.204     
 5 freedom                   2.72      0.626      4.34  0.0000474 
 6 generosity                0.751     0.486      1.55  0.127     
 7 perceptionsofcorruption  -0.579     0.449     -1.29  0.201     
 8 continent_Americas        0.334     0.240      1.40  0.167     
 9 continent_Asia           -0.0611    0.205     -0.298 0.767     
10 continent_Europe          0.280     0.250      1.12  0.267     
11 continent_Oceania         0.163     0.556      0.293 0.771     

Inside Optimization Theory

Considerations

  • Our training problem was simple:

    • Few training instances.

    • Few predictive features.

    • A simple loss function.

  • This means we can do everything using calculus.

  • Most machine learning problems are complex with lots of data and difficult loss functions.

  • We want optimizers that

    • Are simple in terms of the calculus involved.

    • Can be evaluated numerically.

    • Require us to process little data at a time.

  • These notes show how this can be achieved.

Sample Data

  • Consider two standardized features.

  • The data generating process is \(\psi_i = w x_i\).

x y
0.543 0.671
-0.917 -0.682
1.458 1.421
-0.267 -0.896
-0.817 -0.514

Minimizing Loss Through Calculus

  • We minimize \(w = \text{argmin} \ L_2\).

  • The calculus recipe is:

    • Obtain the gradient: \(\nabla = \frac{dL_2}{dw}\).

    • Set to 0 and solve for \(w\).

  • This yields \(w = 0.930\).

But …

We can bypass much of the math and summation and will show this in two steps:

  1. Discovering the minimum through a hill-descending numeric algorithm.
  2. Performing the minimization on small batches of instances or even single samples.

The result is some form of gradient descent—the computational workhorse of machine learning.

Batch Gradient Descent

  1. Make an initial guess of \(\boldsymbol{\theta}\); we call this the starting value.

  2. Update so that the \(t\)th estimate is

    \[ \boldsymbol{\theta}^t = \boldsymbol{\theta}^{t-1} - \alpha \boldsymbol{\nabla}^{t-1} \] where \(\alpha\) is the learning rate.

  3. Repeat (iterate) until convergence.

Illustration

  • \(w^0 = 1.1\).

  • \(\alpha = 0.1\).

x y error xXerror
0.543 0.671 -0.074 -0.040
-0.917 -0.682 -0.327 0.300
1.458 1.421 0.183 0.267
-0.267 -0.896 0.602 -0.161
-0.817 -0.514 -0.385 0.314
old grad adj new
1.1 1.359 0.136 0.964

Going Stochastic

  • Batch gradient descent bypasses the problem of analytically solving \(\boldsymbol{\nabla} = \boldsymbol{0}\).

  • However, we still have to sum over all of the data and, with big data, this can be slow.

  • What if we could sample a single instance and update on that basis \(\rightarrow\) stochastic gradient descent.

  • Recipe:

    1. Set the starting values.

    2. Randomly select an instance from the training set and update using \(\boldsymbol{\theta}^t = \boldsymbol{\theta}^{t-1} - \alpha \boldsymbol{\nabla}_i^{t-1}\).

    3. Make many passes through the training data; each pass is called an epoch.

    4. Repeat this process until convergence has been reached.

Illustration 1

  • Let \(w^0 = 1.1\) and \(\alpha = 0.1\).

  • We select the 3rd instance with \(x_3 = 1.458\) and \(y_3 = 1.421\).

  • For this observation, \(\nabla_3^0 = 2 \cdot (1.1 \cdot 1.458 - 1.421) \cdot 1.458 = 0.533\).

  • Consequently, \(w^1 = 1.1 - 0.1 \cdot 0.533 = 1.047\).

Illustration 2

SGD Simulation

Mini Batch Gradient Descent

  • Batch gradient descent uses all training instances in an iteration:

    • Advantage: Fast convergence.

    • Disadvantage: Requires sums.

  • Stochastic gradient descent uses one training instance at a time:

    • Advantage: No sums.

    • Disadvantage: Slower convergence.

  • A compromise—mini batch gradient descent—is to utilize small batches of data:

    • \(1 < \text{batch size} \ll n_1\).

Optimizing the Optimizer

  1. Adaptive learning rates:
    • Take smaller steps the closer one gets to the optimum.
    • Adaptive moment estimation (adam) is one of the better choices. (Ruder, 2017)
    • Others include RMSprop and adadelta
  2. Exits from local minimums:
    • Complex loss functions may have local minimums.
    • We want to avoid being trapped there.
    • One way to exit local minimums is to add some momentum, for instance, using Nesterov accelerated gradients (Ruder, 2017).
    • These can be combined with adaptive learning rates, e.g., nadam.

References

Biecek, P. (2018). DALEX: Explainers for Complex Predictive Models in R. Journal of Machine Learning Research, 19(84), 1–5.
Breiman, L. (2001). Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author). Statistical Science, 16(3), 199–231. DOI: 10.1214/ss/1009213726
Helliwell, J.F., Layard, R., Sachs, J.D., De Neve, J.-E., Atkin, L.B., & Wang, S. (2023). World Happiness Report 2023 (p. 160). Hall Green, United Kingdom: Wellbeing International.
Japkowicz, N., & Shah, M. (2011). Evaluating Learning Algorithms: A Classification Perspective. Cambridge: Cambridge University Press. https://www.cambridge.org/core/books/evaluating-learning-algorithms/3CB22D16AB609D1770C24CA2CB5A11BF
Korb, K.B. (2004). Introduction: Machine Learning as Philosophy of Science. Minds and Machines, 14(4), 433–440. DOI: 10.1023/B:MIND.0000045986.90956.7f
Mitchell, T.M. (1997). Machine Learning. New York: McGraw-Hill.
Ribeiro, M.T., Singh, S., & Guestrin, C. (2016). "Why Should I Trust You?": Explaining the Predictions of Any Classifier. DOI: 10.48550/ARXIV.1602.04938
Ruder, S. (2017). An Overview of Gradient Descent Optimization Algorithms. arXiv:1609.04747 [Cs]. http://arxiv.org/abs/1609.04747
Thomas, R., & Uminsky, D. (2020). The Problem with Metrics Is a Fundamental Problem for AI. DOI: 10.48550/ARXIV.2002.08512